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(54) Title: INTEGRATED MULTILINGUAL BROWSER 

(57) Abstract 

The disclosed system translates into dif- 
ferent languages HTML documents (16) avail- 
able through the World Wide Web. HTML 
documents (16) are translated by machine trans- 
lation software (10) bundled in a browser 
(12). Alternatively, documents are retrieved as 
needed, translated, and stored on a Web server 
so user requests are serviced with a document 
that has been translated from a diluent lan- 
guage. The disclosed invention expands usage 
of the Internet for nor>-En^ish speakers. 
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INTEGRATED MULTILINGUAL BROWSER 

5 

BACKGROUND AND SUMMARY OF THE INVENTION 
The present invention relates geaerally to the field of electronic conummication over a 
computer networiL Particularly, the present invention relates to the expansion of muhi-hngual 
electronic communication through translation services for documents and messages available 

1 0 through the Internet. 

The recent surge in media attention to the Intemet, and especially the World Wide 
Web, coupled with the continuing growth in home PC ownerdiq) have resulted in a growing 
diversity of the Internet user population. No longer is the Internet the province of software 
experts; thousands of novice users have begun to come online each day. Software like 

15 CompuServe's Web Browser lets users quickly connect to and find useful content onlme. This 
phenomsQon is not restricted to the United States or to English-speaking countries. Growth 
in online usage in Europe and Asia is increasing even more quickly than in the U.S. 

While int^est in the onfine worid is at a peak, a significant obstacle exists to broad 
usage of the Int^et for non-English speakers. The vast majority of Internet cont^t is m 

20 English, and is therefore inaccessble to users with other native languages. Translation of 
Internet documents by a human translator is not a practical solution for two reasons. Krst, 
human translation is costly and dow. A tran^tor can typically produce 300-400 words per 
hour at costs of 120 per word or more. Second, in order to have a translator convert Internet 
documents to the user's native language, the user would have to download every document he 

25 was interested in to provide it to the trandator. This is a time-consuming process, and if the 
user knows no Ei^hsb, he will not even be able to assess the rdevance of the document before 

1 
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downloading h. This would result in wasted time and translation co^s since inevitably, some 
of the documents selected will not prove to be worthwhile. 

The present invention allows non-EngUsh peaking Internet users to access and 
understand faiformation available from the World Wide Web and related sources. Language 
S translation software (known as madiine translation, or MT) is combined with Internet 
software to allow non-English speaking Internet users to quickly generate translations of 
online text. The process is automated and therefore, less costly and time-consummg than 
human translation. Advantages of the present invention are explained fiuther in relation to the 
following detailed description of the inv»tion, drawmgs, and claims. 

10 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figures 1 A and IB conqprise a screen shot of a World Wide Web page; 
Hgure 2 is an exanqile of a hypertext dociunent; 

Figure 3 is an exanQ>le of a hypertext document preprocessed according to the method 
15 of the present invention; 

Figure 4 illustrates a system for performing madune translation; 
Hgures 5A and SB conq)rise an example of a preprocessed hypertext document 
translated according to the m^od of the present invention; 

Figure 6 is an exanq>le of a translated hypertext document postprocessed according to 
20 the n^od of the present mvention; 

Figures 7A and 7B comprise a screen shot of a World Wide Web page that has been 
translated according to the method of the present mvration; 

Hgure 8 is a diagrannnatic view of one embodiment of the present invention in ^^di 
madiine translation is int^;rated into a Web browser; and 
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Figure 9 is a diagrammatic view of one mibodiment of the present invention in wbicli 
pre-translated Web pages are accessible from a server. 

DETAIL DESCRIPTION OF PREFERRED EMBODIMENTrS> 
5 Although the d^ailed descr^tion of a preferred embodiment focuses on automatic 

translation of World Wide V/db pages, the concept is adaptable to documents obuined from 
other sources. 

The Worid Wide Web (WWW or the Wd>) is a distributed information system that 
may be accessed through a nmnb^ of soiux^es. It is con^>rised of software and a set of 

10 protocols and conventions. Information on the Web may be accessed usmg a browser 
program such as Conq>uServe*s Web Browser. Browsers allow users to read documents and 
to locate documents from other sources. They present an inter&ce for interacting v^th the 
Systran and they process requests on bdialf of the user 

Information providers on the WWW make their information available through 

15 programs that understand the HyperText Transfer Protocol (HTTP). Browsers assist users in 
^Visiting" Web sites vAere information is stored. Information is di^layed in pages of text and 
graphics called "Web Pages. An example of a Web page as viewed through CompuServe's 
Web Browser is provided in Figures lA and IB. The Web page shown in Figures lA and IB 
contains both text 14, 18 and graphics 10, 12^ 16. The title bar 20, menu options 22, buttons 

20 24, and document infbrmatipn 26 appearing at the top of the screen are part of the browser 

used to view the Wd) page. 

in most cases, informatioa provide make information available through a Web server. 

The server re^mds to information requests by dehvering the requested information to the 

user^s browso^ for viewing^ Some providers may make their information available through a 

3 
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proxy server that converts tnfonmtion in one format to the fomaat expected and understood 
by the browser. 

Documents available on the WWW and displayed by browsers are hypertext 
documents. Hypertext is text that contains references (or 'links," '^hyperlinks,'" or 'hot 

5 spots") to other documents. The reference is similar to a foomote except the referenced 
document may be accessed directly from the original document. The related document may be 
viewed by selecting or diddng the mouse on the reference. The process of selecting 
hyperlinks to view referenced documents may be referred to as "traversing the hyperhnks." 
UnUke a footnote, references usually do not appear as shorthand descrq)tions of related 

10 documents. Instead, references may be indicated by a combination of graphics, different fonts, 
different colors for the text, tmderlining, the mouse pointer turning mto a hand, etc. The 
referenced documents may reside on different computers at different Web sites. 

Hypertext documents are written m a "markup language" call Hypertext Markup 
Language (HTML). HTML actually refers to both a document type and the markup language 

IS that represents instances of the documrat type. A hypertext docimirat contains general 
semantics appropriate for representing di^lay or presentation diaracteristics as well as 
information from a wide ranges of domains. A hypertext document consists of a sequence or 
stream of characters that coiiq[)rise both data characters and marfciq)s. Markups are 
syntactically defimited characters (such as "<," etc.) added to the data characters to 

20 define the document's structure. Markups thus have ^edal meanings and may rq)resent sudi 
things as hypertext, news, mail, documentation, menus of options, and m-Iine graphics. 
Markups may be conobined with oth^ diaracters or related values to create codes that also 
have spedal m^imm^ Data diaracters are those charaaers in the document diat are not 
codes. 

4 
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Figure 2 is the hypwtext document that describes the Web page shown in Figures lA 
and IB. Figure 2 shows the markups and related words (that conq>rise codes) as well as data 
charaa^s that may appear in a hypertext documCTt. For exaiiq)le, the diaracters and 
appearing throughout the document are markups. The characters and combined with 
5 the word "head" ("<head>") 10 may be considered a code. Finally, the text '"NLT Home'' 10 
that is not surrounded by markups or codes may be considered data characters. 

As mdicated by the brief descrq)tion, HTML documents have a wdl-defined and 
documented stmcture defined by a grammar. The codes in a HTML document convey 
inqiortant information regarding both the display or presentation of the document itself as well 
10 as related referoices and commands. Display and presentation nifonnation may inchide color 
information, information about graphics that appear on the page, information about text that 
appears on the page, etc. A HTML document is structured as a series of elements that are 
identified by the language markups and codes. A document includes a head (consisting of a 
title and other optional elements) and a body that is a text flow of paragraphs, lists, images, 
15 and other elemaits. The various parts of the doomient may be identified by looking at the 
markiq)s or codes in the document. For exanq>]e, referring again to Figure 2 vAiidb shows the 
hypertext for Rgures 1 A and IB, the document head contains the title ^"NLT Home" 10. An 
image contained in the document is identified in the line 

"<brximg src="fite:///n|/iowebav/server/8100--l.l/server-l^nMge/ntl.jpg'^ height=60 
20 width=640></center>" 12. 

As may be apparent, the process of translating a HTML document requires 
examination of each character in document. Characters may be examined individually and in 
combination to d^ermine Aether they are markups, codes, or data diaracters. To process a 
dooimmt, the processing software examines the diaracter stream that comprises the 
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docummt. The steps needed to trandate a HTML document from one language to another 
may be summarized as follows: 

Step 1. Pr^rocess the HTML document by placing boundary markers around 
HTML codes to be pressed during the translation process. The 
translation software recognizes the boundary markers and does not 
trandate text and symbols appearing b^ween the markers. 
Step 2. Trandate the preprocessed HTML document from the original language to 
the target language. 

Step 3* Postprocess the tran^ted HTML document to remove the boundary 
markers. 

Step 1. The codes in a HTML document convey tnq)ortant information describing the 
characteristics of the Web page. Referring again to Figure 2, an exanq)Ie of the type of 
information contained in a hypertext document is shown. Certain information contained in the 
document of Figure 2 may be interpreted by a Web browser so thai to the browser user, the 
images shown in Figures 1 A and IB appear. Certain information in the hypertext docimient is 
preserved during the translation process so that the translated page has, in general, the same 
appearance and behavior as the original page. Because HTML documents have a well-defined 
and known structure described by a gransmar, automated translation of a HTML document is 
possible. The codes in the document may be discerned by the preprocesang software. Special 
boundary markers placed in the documaat by the prq)rocessing software indicate to the 
translation software that the intervoiing text should not be translated. Consequently, the 
resulting page may have the same appearance and bdiavior as the original page. 

Referring to Figure 3, an ownple of a prq>rocessed HTML document is shown. The 
HTML document of Figure 3 is the preprocessed version of the HTML document shown in 
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Figure 2. In ttiis example, the boundary markers used to idratify the HTML codes are the 
character pairs and *'.}". Any diaracter or character combination that does not normally 
occur in text may be used as a boundary marker The line that appeared as 
"<head><titie>NLT Home<titlexhead>" in Figure 2 (10) is prqorocessed in Step 1 to the line 
5 "{ <head>. } { <title>. }NLT Home{ .<title>. ) { .<head> }" in Figure 3(10). Other hnes in the 
document are prq)rocessed similarly. 

Step 2. Machine Translation (MT) software performs the translation of text from one 
language to another language. There are many commercially available MT software packages. 
Figure 4 is an illustration of a system in which MT software 10 takes as input text in one 

10 language 12 and generates a rough draft translation of the text in another language 14 using an 
electr<mic dictionary 16 and a set of linguistic and/or statistical rules encoded in the program 
18. MT software can perform language conversion operations very quickly; in some cases, at 
^eeds of iq) to 3,000 words per minute. The translated texts are not high quality translations, 
but they are usually adequate for understanding ^^t the document is about. 

15 Refening to Figures 5 A and 5B, an example of a translated HTML document is 

shown. The HTML document of Figures 5A and 5B is the translated veraon of the 
preprocessed HTML document shown in Figure 3. As described above, the boundary markers 
used to identify the HTML codes are the character pairs "{." and Consequently, the MT 
software ignores all text that &lls between the boundary markers. Data characters that are not 

20 surrounded by boundary markm are translated by the MT software. The preprocessed line 
that appeared as "{.<head>.}{.<title>.}NLT Home{.<title>.}{.<head> }" in Figure 3 (10) is 
translated in Step 2 to the line "{.<head>.}{.<title>.}NLT Maison{.<title> }{.<head> )" in 
Figure 5A( 10). 



7 
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Step 3. In the final step, postprocessing software removes boundary markers firom the 
translated docmnent. Referring to Figure 6, an exanq)le of a postprocessed HTML documoit 
is shown. The HTML document of Figure 6 is the po^rocessed version of the translated 
HTML document shown in Figures 5A and SB. As described above, the boundary markers 

5 used to identify the HTML codes are the character pairs and ".}". During 

postprocesang, these boundary markers are rraioved. The translated hne that appeared as 
^{.<head> }{.<title>.}NLT Maison{ <title>.}{.<head> in Figure 5A (10) is postprocessed 
in Step 3 to the Kne "<headxtitle>NLT Maison<titlexhead>" in Figure 6 (10). Ihe 
postprocessed FTTML document of Figure 6 may then be displayed by the browser as shown in 

10 Figures 7A and 7B. 

Figure 8 is a diagrammatic view of one embodiment of the present invention in which 
machine translation is integrated into a Web browser. MT software 10 may be combined with 
a browser 12 to allow the user 14 to rapidly and automatically translate online documrats 
firom the World Wide Web 16 into his native language. The MT software 10 may be bundled 

15 with the browser 12 to form an integrated multilingual browser. The user 14 of the 
multilingual browser 16 selects the desired targ^ language, (e.g. French if the user speaks 
French), and the Web document retrieved by the browser 18 may be rapidly translated on*the- 
fly with a mouse clicL The Web Browser 12 then displays for the user 14 the trandated 
document 20. Optionally, the user may be able to update and edit parts of the MT software's 

20 electronic dictionaries to indude terminology conmion to the Web sites he visits. 

Although a document may be translated at the time that a user requests access to the 
document, a document may also be "^re-translated" and stored in a cache for later retrieval 
before a user seeks access to it Documents that have been accessed at least once may also be 
stored following translation. The advantage of storing documoits that have been translated is 
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that deUvery time to the user may be reduced. Although storing documents requires disk 
^ace, it may rq>resent a beaer use of system resources because documrats that are accessed 
frequently are translated once rather than every time they are accessed. 

Figure 9 is a diagrammatic view of an ahemative inq)lementation in which pre- 
5 translated Web pages are stored on a server 14. The translation software reddes on a 
trandation server 14 (possibly the same madiine as the Web server). Popular Web pages 24 
are pre-translated and stored in a cache 28, with additional pages being added as they are 
requested by users 20. The cache is a dynamic storage device with a finite capacity. New, 
pretrandated pages are added to the cache, but pages will also be removed from the cache if 
10 they are used infrequently or if there are constraints on storage capacity. 

In accessing the system, a user 10, saids to the Web Server 14 a request for a ^ecific 
page in a ^edfic language 12. The Web Server 14 then sends a request to get the desired 
page 16. The m^od for servicing the request depends on where the page is located. If the 
page has be^ pre-translated 24 and stored in the cache of pages in multiple languages 28, it is 
15 retrieved from the cache 26 and returned to the user in the requested language 30. If the page 
has not been pre-trandated. then the page is retrieved 20 frx)m the World Wide Web 22, 
translated into the requested language, and cached before being sent to the user 30. 

Translation ofV/db pages, in either the bundled browser/NfT configuration or the Web 
Server configuration, requires proces^g of HTML codes containing reference, command, 
20 and display information. Preferably, the HTML codes are identified prior to translation, then 
surrounded by fecial boundary markers to block the translation process on the codes. The 
HTML preprocessor uses its knowledge regarding the markiq)S, codes, data diaracters and the 
structure of HTML documents to determine \^ch codes should be blocked from the 
translation process. After translation is conqilete, a postprocessing program removes the 
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special boundary markers so that the necessary references, commands, and display 
characteristics are available in the trandated text. 

The primary objective of the present invention is to allow a user of the Internet to read 
hypertext documents that are available only in a language foreign to the user. The readable 
5 text of the hypertext document is (Ranged in accordance with the user's preferred language. 
Steps are taken to preserve the document's appearance and bdiavior so that the only 
noticeable difference between the original document and the translated docimaent is the 
language of the t^. Users may interact with the translated document and reference related 
documents in the same noanner that users interact with the original documoit. 



10 
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WHAT IS CLAIMED IS: 
1. A m^od for translatiiig a documoit, comprising the sleps of: 

providing a character stream including codes and data characters in a first 
language; 

5 transmitting said diaracter stream to a language trandator; 

recognizing said codes to prevrat trandation of said codes by said language 
translator; and 

trandating a agnificant portion of said data characters into a second language 
using said language translator. 
10 2. The method of daim 1, wfa^dn said codes are HyperText Markup Language codes. 

3. The method of claim 1, wherein said step of recognizing said codes is performed by a 
language tran^tor preprocessor. 

4. The method of daim I, v\^eretn boimdary markers are placed around said codes to 
prevent translation of said codes by said language translator. 

15 5. The method of claim 1, wherein said language translator is integrated into a browser 
program. 

6. The method of claim K A^ardn said document is pretranslated. 

7. The method of daim 1, further comprising the step of viewing said translated 
HyperText Mailo^ Language docum^it with a browser. 

20 8. A document translation system, comprising: 

a charact^* stream containing codes and data characters in a first language; 
a preprocessor for marking codes in said diaracter stream; 
a language translator for translating into a second language said data characters 
in said preprocessed character stream; and 

11 
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a postprocessor for unmarking said codes in said translated character stream. 

9. The system of claim 8, wherein said codes are defined using HyperText Markup 
Language. 

10. The system of claim 8, \^erem said preprocessor, said language translator, and said 
postprocessor are integrated into a document browser. 

11. The system of claim 8, ^erem said prqprocessor, said language translator, and said 
postprocessor are integrated into a Web server process. 

12. The system of claim 8, fiirther comprising a browser for viewmg said trandated 
HyperText Markup Language document. 

13. A method for tran^ting documrats, compriang the steps of: 

providing in a first language a document containing di^lay and reference codes 
and data diaracters exclusive of said display and reference codes; 

viewing and interacting with said document in said first language; 

translating said data diaraaers to a second language; and 

viewing and interacting with said translated document in substantiaUy the same 
manner as said document in said first language. 
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